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Abstract 

Recent advances in supervised salient object detec¬ 
tion has resulted in significant performance on benchmark 
datasets. Training such models, however, requires expen¬ 
sive pixel-wise annotations of salient objects. Moreover, 
many existing salient object detection models assume that 
at least one salient object exists in the input image. Such 
an assumption often leads to less appealing saliency maps 
on the background images, which contain no salient object 
at all. To avoid the requirement of expensive pixel-wise 
salient region annotations, in this paper, we study weakly 
supervised learning approaches for salient object detection. 
Given a set of background images and salient object images, 
we propose a solution toward jointly addressing the salient 
object existence and detection tasks. We adopt the latent 
SVM framework and formulate the two problems together in 
a single integrated objective function: saliency labels of su¬ 
perpixels are modeled as hidden variables and involved in 
a classification term conditioned to the salient object exis¬ 
tence variable, which in turn depends on both global image 
and regional saliency features and saliency label assign¬ 
ment. Experimental results on benchmark datasets validate 
the effectiveness of our proposed approach. 

1. Introduction 

Attention!!! There was a mistake of one-class SVM for 
salient object detection in the previous vision since the ker¬ 
nel trik is used for the training phrase only. 

Salient object detection, deriving from classical hu¬ 
man fixation prediction [16], aims to separate the entire 
salient object(s) that attract most of humans’ attention in 
the scene from the background [25 ]. Driven by applications 
of saliency detection in computer vision, such as content- 
aware image resizing [1 ] and photo collection visualiza¬ 
tion [39], many computational models have been proposed 
in the past decade. 

There are two main motivations behind this paper. On 
one hand, recent advances in supervised salient object de¬ 
tection has resulted in significant performance on bench- 
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Figure 1. Saliency maps produced by three state-of-the-art models 
on the background images. 


mark datasets [20]. Yet it is time consuming and tedious to 
annotate salient objects in order to train a model. On the 
other hand, it is usually assumed that at least one salient ob¬ 
ject exists in the input image by most existing salient object 
detection algorithms (See [3]). However, as shown in Fig. 1, 
there exist background images [40], where there are no 
salient objects at all. Based on this impractical assumption, 
all of three state-of-the-art approaches [32, 20, 48] produce 
inferior saliency maps on background images. To this end, 
we study how to utilize weakly labeled data to train salient 
object detection models. Given a set of background images 
and salient object images, where we only have annotations 
of salient object existence labels, our goal is to train a salient 
object detection model. 

In this paper, we propose a weakly supervised learning 
approach to jointly deal with salient object existence and 
detection problems. The input image is first segmented into 
a set of superpixels 1 . Saliency labels of superpixels (i.e., 
foreground or background) are then modeled as hidden vari¬ 
ables in the latent structural SVM framework, where the 
inference can be efficiently solved using the graph cut al¬ 
gorithm [7]. The training problem is built upon the large- 
margin learning framework to separate the salient object im¬ 
ages and the background images. Our proposed weakly su- 


1 In this paper, we use the terms superpixel and region interchangeably. 
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pervised approach is based on a set of unsupervised meth¬ 
ods [9, 32, 45, 48]. Compared with supervised approaches, 
we do not require strong pixel-wise salient object annota¬ 
tions. Furthermore, our approach is capable of recognizing 
the existence of salient objects. 

Our main contributions therefore are two folds: ( i ) we 
propose a weakly supervised learning approach based on 
the latent structuralSVM framework,instead of expensive 
salient object annotations; (ii) compared with conventional 
approaches, our proposed approach is capable of jointly ad¬ 
dressing salient object existence and detection problems. 
Our approach performs better than most of unsupervised 
salient object detection models and is comparable with the 
best supervised approach. 

2. Related Work 

In this section, we briefly introduce related works in two 
areas: salient object detection and weakly supervised learn¬ 
ing for vision tasks. 

Salient object detection. We refer readers to [4, 6] for 
a comprehensive review of salient object detection models. 
Here, we briefly introduce some of the most related works. 

Visual saliency is usually related to the uniqueness, dis¬ 
tinctiveness, and disparity of the scene. Consequently, most 
of existing works focus on designing models to capture 
the uniqueness of the scene in an unsupervised setting. 
The uniqueness can be computed for each pixel in the fre¬ 
quency domain [1], by comparing a patch to its most simi¬ 
lar ones [15], or by comparing a patch to the average patch 
of the input image in the principal components space [28]. 
Benefiting from image segmentation algorithms, more and 
more approaches try to compute the regional uniqueness in 
a global manner [9, 32, ], based on multi-scale [19] and 
hierarchical segmentations of the image [A ]. Moreover, 
several priors about a salient object have been developed 
in recent years. Since a salient object is more likely to be 
placed near the center of the image to attract more attention 
(i.e., photographer bias), it is natural to assume that the nar¬ 
row border of the image belongs to the background. Such 
a background prior is widely studied [42, 45, 18, 24]. It 
is recently extended to the background connectivity prior 
assuming that a salient object is less likely connected to 
the border area [48, 47]. In addition, generic objectness 
prior is also utilized for salient object detection [8, 21, 17]. 
Other priors include spatial distribution [32, 10] and focus- 
ness [21]. 

There also exist supervised salient object detection mod¬ 
els. The Conditional Random Field [25, 27] and Large- 
Margin framework [2 ] are adopted to learn the fusion 
weights of saliency features. Integration of saliency features 
can also be discovered based on the training data using Ran¬ 
dom Forest [20], Boosted Decision Trees (BDT) [29, 23], 
and mixture of Support Vector Machines [22]. 


Our proposed weakly supervised approach is built upon 
the basis of the feature engineering of several unsupervised 
approaches [9, 32, 45, 48]. Compared with supervised ap¬ 
proaches, however, our approach does not rely on strong 
saliency annotations, where we merely utilize the weak 
salient object existence labels of training images. More¬ 
over, our proposed latent salient object detection approach 
(Sec. 3) is capable of jointly addressing the salient object 
existence and detection problems. 

Salient object existence prediction. In [40], the salient 
object existence prediction problem is studied as a standard 
binary classification problem based on global saliency fea¬ 
tures of thumbnail images. Zhang et al. [46] investigate 
not only existence but also counting the number of salient 
objects based on holistic cues. In this paper, we focus on 
recognizing salient object existence. By incorporating la¬ 
tent superpixels’ saliency label in our approach, better per¬ 
formance than [40] can be achieved. Moreover, salient 
object existence labels are used to train a weakly super¬ 
vised salient object detection model, predicting superpixels’ 
saliency scores. 

Weakly supervised learning. Visual data that are ubiq¬ 
uitously available on the web are in nature weakly labeled, 
e.g ., images on Flickr and videos on YouTube with tags. 
To leverage these data, weakly supervised learning meth¬ 
ods are extensively studied for vision tasks such as object 
detection [31, 12], concept learning [36], scene classifica¬ 
tion [14, 31], semantic image segmentation [38], etc. 

In essence, our proposed approach is closely related to 
the work of visual concept mining from weakly labeled 
data [35], where we label the test data based on a strongly 
annotated negative training data. Compared with [35], our 
approach is more suitable for salient object detection. In 
addition, our latent salient object detection based on the la¬ 
tent structural SVM is closely related to the hidden [33] and 
max-margin [41] conditional random fields. 

3. Weakly Supervised Salient Object Detection 

In this section, we first present a weakly supervised 
approach for salient object detection based on the latent 
structural SVM framework (Sec. 3.1). We then introduce 
saliency features used for salient object existence prediction 
and detection tasks (Sec. 3.2). 

3.1. A Latent Structural SVM Formulation 

In this paper, we are interested in learning a model that 
can not only predict whether there exist salient objects in 
the input image but also where the salient objects (regions) 
are (if they exist). Our weakly annotated training data is 
composed of a set of images and their ground-truth annota¬ 
tions of salient object existence labels (i.e., salient object 
images vs. background images). Unlike supervised ap¬ 
proaches [20, 23], our approach does not need ground-truth 


annotations of regional saliency labels of the training sam¬ 
ples. We call our approach weakly supervised salient ob¬ 
ject detection since the supervision comes merely from the 
salient object existence annotations. It is worth exploring 
weakly supervised learning since it requires far less annota¬ 
tion effort than a supervised one. 

Denote the input image as /, which consists of N super¬ 
pixels {ri}f =1 . Salient object existence label of the image 
is represented by a binary label y G y, where y = {0,1} 
denotes if there exist salient objects (0 for no existence). 
Regional saliency labels of the image are denoted as h = 
[hi\iLi> where hi e W, = {0,1} indicates the saliency label 
for the superpixel r l (0 is for background). 

Given a set of training samples {(/ m , Vm)}n f=i» our goal 
is to learn a model that can be used to predict the salient ob¬ 
ject existence label y as well as regional saliency labels h 
of an unseen test image. To this end, we learn a discrim¬ 
inative function / w : 1 x y R over the image I and 
its salient object existence label y, where w are the param¬ 
eters. During testing, we can use / w to predict the class 
label y* of the input image as y* = arg ma x ye y f w (I,y). 
Due to lack of annotations, we model regional saliency 
labels h as hidden variables in the latent structural SVM 
framework. We assume / w (/,?/) takes the following form 
/ w (7,y) = max h (w, 'S?(I,y,h)), where '&(I,y,h) is a 
feature vector depending on the input image /, its salient 
object existence label y, and regional saliency labels h. 

We consider the global features <f> e (/) of the input im¬ 
age I to capture the salient object existence in a holistic 
manner as in [40, 46]. Additionally, each superpixel r* is 
represented by two feature vectors (/) and $£(/), mod¬ 
eling its negative log-likelihood of belonging to the fore¬ 
ground and background, respectively. Their detailed defi¬ 
nitions are introduced in Sec. 3.2. To account for the spa¬ 
tial constraints of two adjacent superpixels that they tend to 
share the same saliency labels, we construct an undirected 
graph Q — (V, £). The vertex j e V corresponds to the 
saliency configuration of the superpixel rj and (j, k) G £ 
indicates the spatial constraints of superxpixles rj and r/~. 
Finally, (w, 4/(1, y , h)) is defined as follows, 

(w MI,y, h )> = 5 ( y = a K W a’ $e (-0> 

aey 

+ % = a )J2 5 ( h i = (( W a» $ f( 7 )) + W l) 

ae{ 0 , 1 } jev 

+ S (y = a )J2 5 ( h i = °) « W a’ $ i(-0> +™a) 

ae{ 0,1} jev 

- ^ S(hj ± h k )w p ■ v jk . (1) 

(j,k)e£ 

The model parameters w are the concatenation of the pa¬ 
rameters of all the factors in the above equation, i.e., w = 

[w e , w®, ,w{,w h a , w p ] ae y , where w{ and w h a are two prior 


terms for each region to be foreground and background, re¬ 
spectively. 

In the above formulation, both salient object existence 
prediction and detection problems are modeled together in 
a single integrated objective function. Salient object exis¬ 
tence label does not only depend on the global image fea¬ 
tures 4> e (J) in a standard classification term, but also on the 
regional saliency labels h and features 4>J(/) and 4^(1). 
Although we are at the same supervision level as existing 
supervised models of predicting salient object existence la¬ 
bels [40, 46], regional saliency labels are taken into consid¬ 
eration as latent variables in our approach. 

In turn, regional saliency labels h are dependent on the 
salient object existence label y as well. We learn two groups 
of model parameters for salient object detection on salient 
object images and background images, respectively. More¬ 
over, we learn two prior terms w l and w i modeling the 
influence of salient object existence label y on the latent 
salient object detection h. The last smoothness term en¬ 
courages adjacent regions to take the same saliency label. 
Vjk captures the similarity of two neighboring regions rj 

i c 7 — c fci 2 

and r/c. It is defined as Vjk = e 2a c , where c j is the 
average color vector of the superpixel rj and parameter cr c 
is set manually. 

3.2. Saliency Features 

In the past decade, reserchers have been mainly concen¬ 
trating on designing various features to describe salient ob¬ 
jects. Inspired by [' ], more and more research effort is spent 
at the region level. In this paper, we consider the following 
five kinds of regional saliency features. 

Global contrast. As studied in [9, 32, 5], the more dis¬ 
tinct a region from others, the more salient it might be. Re¬ 
gional global contrast 4>f c (/) is computed by comparing the 
region r* to others, where nearby regions are given larger 
weights to determine the contrast value. 

Spatial distribution. It is also an extensively studied 
saliency feature [25, 32, 10, 5], indicating that the wider a 
region spreads over the image, the less salient it is. Fol¬ 
lowing [32], we compute the spatial distribution T>^(/) by 
computing spatial distances of the region with others, 
which are weighted by their appearance distances. 

Backgroundness. Since the salient object is placed near 
the image center to attract more attention, the image borders 
B are thus more likely belong to the background. Follow¬ 
ing [20], the regional backgroundness 4>^(7) is computed 
by examining the region r z with respect to B based on dif¬ 
ferent appearance features. 

Manifold ranking. In addition to directly comparing 
each region to the image border B , a region’s saliency score 
can also be defined based on its relevance to B via graph- 
based manifold ranking [45]. Following [4 ], we compute 
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Figure 2. Illustration of saliency features computed on the Lab color histogram channel. From left to right: (a) input images, (b) global 
contrast, (c) spatial distribution, (d) backgroundness, (e) manifold ranking, (f) boundary connectivity, and (g) final saliency maps. 


the ranking score for the region r* w.r.t each side of the 
image border B and combine them together to get the final 
manifold ranking score 

Boundary connectivity. It is suggested in [48] that 
a salient region is less likely connected to the pseudo¬ 
background B. To this end, the boundary connectivity score 
of the region Vi is defined as the ratio between its 
spanning area and the length along the image border. 

For robustness, we compute saliency features on differ¬ 
ent appearance channels including average RGB, RGB his¬ 
togram, average HSV, HSV histogram, average Lab, Lab 
histogram, and Local Binary Pattern (LBP) histogram. Fea¬ 
ture distances are computed as the y 2 distance for his¬ 
tograms and as absolute Euclidean distance for others. 
Each dimension of the feature is normalized in the range 
[0,1]. Finally, we concatenate these five feature descriptors 
$|(7) = [$f c (7),$^(7),$^(7),$^(7),$f c (7)j. Into- 
tal, we obtain a 35-dimensional feature vector. See Fig. 2 
for examples of different saliency features. We refer readers 
to the original papers for more technical details. 

Based on saliency features {T>|(/)}^ 1 , we adopt the 
same holistic manner as in [40] to capture the existence 
of salient objects. We resize the pixel-wise saliency map 
resulting from each appearance channel of {4>f(/)}^i to 
300 x 300 and divide it into 5x5 grids, concatenating the 
average saliency value in each grid to form a global saliency 
feature vector Additionally, we also consider the 

GIST descriptor [3 ] $ GIST (I), computed as a concatena¬ 
tion of averaged responses of 32 Garbor-like filters over a 
4x4 grid . Finally, we get a 1387-dimensional (5 x 5 x 35 + 
32 x 4 x 4) feature vector <F e (/) = [<h G,s (/), & GIST (I)} to 
capture salient object existence. 

We also define (/) = — log (1 — <Ff (/)) and I ) = 
— log (<Ff (/)), which can be regarded as the negative log- 
likelihood of each region belonging to the foreground and 
background, respectively. Since <Ff (/) G [0,1], ${(/) in¬ 
creases as it raises while /) decreases, indicating a re¬ 


gion is more likely to be categorized as foreground with 
larger saliency feature values. 


4. Learning and Inference 

In this section, we introduce how to learn our model pa¬ 
rameters w from training samples (Sec. 4.1) and how to 
infer both the salient object existence label y and regional 
saliency labels h given a test image (Sec. 4.2). 

4.1. Large Margin Learning 

Given a set of training samples {(/ m , we find 

the optimal model parameters by minimizing the following 
regularized empirical risk [13], 

A 1 M 

min i(w) =-||w || 2 + — ^ ( 2 ) 

“ m= 1 

where A controls the trade off between the regularization 
term and the loss term. R m ( w) is a hinge loss function 
defined as 


Rm( w) = max ((w, $(7 m , y, h)) + A(y mi Vi h)) 
y, n 

-max(w, (3) 

h 

where the loss function A (?/ m , y , h) is defined as follows 

mi Vi h) = S(y m ± y) +a(y m ,h). (4) 


The first term is the 0/1 loss widely used for multi-class 
classification. In addition, we introduce the second term 
to constrain the latent salient object segmentation. For a 
background image, its regional saliency labels should be all 
zeros. For a salient object image, we resort to the pseudo¬ 
background prior [ 20 ] to treat all the saliency labels of re¬ 
gions in the border area of the image as zeros. To this end, 
the second loss term can be written as 


KVti 
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where fa is the area of the region 77. Zo and Zi are normal¬ 
ization terms to ensure a(y m , h) G [0,1]. 

Eq. 2 can be efficiently minimized using the bundle op¬ 
timization method [1 ], which iteratively builds an increas¬ 
ingly accurate piecewise quadratic approximation of the ob¬ 
jective function L( w) based on its sub-gradient dL( w). We 
first define 

h* = argmax((w,W(/ m ,y, h)) + A(y m , y, h)), Vra, My £ y, 

y h 

y*m = arg max «w, *(/ m , y , h )) + A(y m , y, h* y )) , (5) 

The sub-gradient dL( w) can then be computed as 

9L( w) = Aw + 'H(I m ,y* m ,h* y , rn ) - h* m ). 

Given the sub-gradient dL(w), the optimal model param¬ 
eters can then be learned by minimizing Eq. 2 using the 
method in [1 ]. 

4.2. Inference 

Given a test image /, we maximize Eq. 1 to jointly 
predict its salient object existence label y* and regional 
saliency labels h* as follows, 

(j/*,h*) = arg max(w,^(/,y,h)). (6) 

yey, h 

Since the search space y of y is small, we can iterate over 
all its possible values. Given any y G y, we utilize the max- 
flow algorithm [7] to optimize the Eq. 1 to get the optimal 
regional saliency labels. 

During training, we have to solve the loss-augmented en¬ 
ergy function Eq. 6. Luckily, we can incorporate the loss of 
regional saliency labels into the unary term of Eq. 1 . There¬ 
fore, we can again utilize the max-flow algorithm [7] for 
efficient inference. 

To output a saliency map, we diffuse the latent segmenta¬ 
tion result of salient object using the quadratic energy func¬ 
tion [2( ] as follows, 

z = 7 (I + 7 L)- 1 Ih, (7) 

where z = z t £ [0,1] is the saliency value of 

the superpixel 77. I is the identity matrix. V = [vij] 
and D = diag{d \\, • • • , djviv} is the degree matrix, where 
da = . L = D — V is the Laplacian matrix. 

5. Experimental Results 
5.1. Setup 

Background images publicly available in the literature 
are only the thumbnail background image dataset [40] . Im¬ 
ages in this dataset, however, are of low resolution (130 x 
130). Since we are interested in images with common sizes 


Table 1. Taxonomy of different salient object detection algorithms 
based on supervision type and tasks that each method can solve. 
(Abbreviations unspvd. and spvd. denote unsupervised and super¬ 
vised, respectively.) 


methods 

supervision 

task 

pub. & year 

SVO [ l] 

unspvd. 

detection 

ICCV 2011 

CA [15] 

unspvd. 

detection 

CVPR 2010 

CB [19] 

unspvd. 

detection 

BMVC 2011 

RC [9] 

unspvd. 

detection 

PAMI 2015 

SF [32] 

unspvd. 

detection 

CVPR 2012 

LRK [ 4] 

unspvd. 

detection 

CVPR 2012 

HS [44] 

unspvd. 

detection 

CVPR 2013 

GMR [45] 

unspvd. 

detection 

CVPR 2013 

PCA [28] 

unspvd. 

detection 

CVPR 2013 

MC [ $] 

unspvd. 

detection 

ICCV 2013 

DSR [24] 

unspvd. 

detection 

ICCV 2013 

RBD [ 8] 

unspvd. 

detection 

CVPR 2014 

DRFI [20] 

spvd. 

detection 

CVPR 2013 

HDCT [23] 

spvd. 

detection 

CVPR 2014 

GS [40] 

spvd. 

existence 

CVPR 2012 

localization 

SOS [46] 

spvd. 

existence 

CVPR 2015 

counting 

LSSVM 

weakly spvd. 

detection 


spvd. + latent 

existence 



(e.g., 400 x 300), this dataset is not suitable for our sce¬ 
narios. To this end, we collect 6182 background images 
from the SUN dataset [43], describable texture dataset [11], 
Flickr, and Bing image search engines. We randomly sam¬ 
ple 5000 background images to train our model and leave 
other 1182 images for testing. Additionally, we randomly 
sample 5000 images from the MSRA10K dataset [9] for 
training and 1237 images for testing. In total, we have 
10000 images for training and 2419 for testing. 

For the salient object detection task, we evaluate our 
proposed approach (LSSVM) on MSRA-B [21 ] and EC- 
SSD [44] datasets with pixel-wise annotations. MSRA- 
B contains 5000 images with variations including natural 
scenes, animals, indoor scenes etc. There are 1000 seman¬ 
tically salient but structurally complex images in ECSSD, 
making it very challenging. 

We compare our approaches with 14 state-of-the-art 
salient object detection models, including 12 unsupervised 
methods and 2 supervised models, which are summarized 
in Tab. 1. Following the benchmark [ 6 ], for quantita¬ 
tive comparisons, we binarize a saliency map with a fixed 
threshold ranging from 0 to 255. At each threshold, we 
compute Precision and Recall scores. We can then plot a 
Precision-Recall (PR) curve. To obtain a scalar metric, we 
report the average precision (AP) score defined as the area 
under the PR curve. Additionally, we also report the Mean 
Absolute Error (MAE) scores between saliency maps and 
the ground-truth binary masks. 

















INo G obal Contrast 

■No Spatial Distribution 
□No Backgroundness 
□No Manifold Ranking 
■No Boundary Connectivity 
~~ IFulj_ _ 

MSRA” 

(a) (b) (c) (d) 

Figure 3. Empirical analysis of our approach on the test set, MSRA-B, and ECSSD datasets. From top to bottom: (a)(b): accuracy of 



salient object existence prediction and AP scores of salient object detection versus different number of training images (M in Eq. 2), (c)(d): 
accuracy of salient object existence prediction and AP scores of salient object detection versus different settings of feature combinations. 


5.2. Empirical Analysis of Our Approach 

Here we empirically analyze our proposed approach on 
the test set, MSRA-B and ECSSD datasets. In particular, we 
quantitatively study the performance of both salient object 
detection and salient object existence prediction tasks by 
varying the following parameters. 

Number of Training Images. As can be seen 
from Fig. 3(a), the latent structural SVM benefits from 
larger number of training samples, where the classification 
accuracy almost keeps increasing when more training im¬ 
ages are adopted on all three datasets. However, according 
to Fig. 3(b), the performance of salient object detection does 
not always increase when more training samples are avail¬ 
able. The reason might be that in contrast to the salient ob¬ 
ject existence, we have indirect (weak) supervision during 
training to constrain the salient object segmentation results. 

Feature Importance. To measure the importance of fea¬ 
tures, we remove each kind of feature set and observe the 
performance variations on both tasks. In terms of salient 
object existence prediction, according to Fig. 3(c), the fea¬ 
ture importance on three datasets are diverse. For instance, 
backgroundness is recognized as the most important on the 
test set while considered as the least critical one on MSRA- 
B. Regarding the salient object detection tasks according 
to Fig. 3(d), the ranking of feature importance is consis¬ 
tent on MSRA-B and ECSSD. Features, from the most im¬ 
portant to the least important are: boundary connectivity, 
global contrast, backgroundness, spatial distribution, and 
manifold ranking. It is worth noting that the full feature 
vector performs the best. 

5.3. Salient Object Existence Prediction 

Here we quantitatively study our proposed approach in 
terms of the salient object existence prediction task. We 
compare our approach with three baselines, where we train 
a linear SVM, two non-linear SVMs (using the x 2 and 
rbf kernels, respectively), and a Random Forest using our 
global image features <£> e (/). As we can see in Tab. 2, 
by considering latent variables, our proposed approach 
(FSSYM) can achieve higher accuracy than the linear SVM. 


Table 2. Classification accuracy of different approaches on bench¬ 
mark datasets. (Updated.) 



Test Set 

MSRA-B 

ECSSD 

linear SVM 

90.20 

87.84 

75.20 

X 2 SVM 

93.14 

90.80 

81.80 

rbf SVM 

95.37 

93.16 

82.90 

RF 

92.24 

91.52 

84.50 

[40] 

90.64 

89.26 

72.50 

LSSVM 

93.96 

90.82 

76.90 

X 2 LSSVM 

95.58 

92.54 

79.90 


However, since both the rbf SVM, x 2 SVM and Random 
Forest are non-linear classifiers, they perform better than 
our approach. This motivates us that our approach may fur¬ 
ther benefit from non-linearly transforming our global fea¬ 
tures (via a kernel function). Therefore, we train a non¬ 
linear version of FSSVM (denoted as x 2 FSSVM), where 
we use the explicit feature mapping [3' ] to transform <F e (/) 
to approximate the x 2 kernel. As can be seen, benefit¬ 
ing from latent variables, its classification accuracy is still 
higher than its baseline (x 2 SVM) on both MSRA-B and 
ECSSD datasets. Moreover, it achieves the highest classifi¬ 
cation accuracy on the test set. 

Compared with the state-of-the-art approach in [40], our 
approach has two advantages, more powerful features and 
incorporation of latent saliency information. Though a non¬ 
linear classifier (Random Forest) is utilized in [40], as we 
can see from Tab. 2, our approach has higher classification 
accuracy on all datasets. Moreover, compared with [40], our 
approach is able to jointly address salient object existence 
and detection problems. 

5.4. Salient Object Detection 

In this section, we compare our FSSVM approach with 
other state-of-the-art salient object detection approaches. 
Our FSSVM approach is designed to address the limit of 
conventional approaches, where they impractically assume 
that at least one salient object exists in the input image. For 
more fairer comparisons, we introduce a two-stage scheme 
to make comparisons fairer. Specifically, we first predict the 
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(a) input (b)SF[: ] (c) GMR [45] (d) DSR [24] (e) RBD [48] (f) HDCT [23] (g) DRFI [20] (h)LSSVM 

Figure 4. Qualitative comparisons of saliency maps produced by different approaches. From left to right: (a) input images, (b)-(g) saliency 
maps of state-of-the-art approaches, (h) saliency maps of our proposed approach LSSYM. 


existence label of salient objects using the rbf SVM intro¬ 
duced in Sec. 5.3. If there are no salient objects, we output 
an all-black saliency map. Otherwise, we generate saliency 
maps using different approaches. 

In addition to MSRA-B and ECSSD benchmark datasets, 
we check performance of different approaches on the test 
set consisting of 1237 salient object images and 1182 back¬ 
ground images. Since ground-truth annotations of back¬ 
ground images are all-black images, only MAE scores are 
feasible to report on the test set. See Tab. 3 and Fig. 5 for 
quantitative comparisons. 

Since an all-black saliency map is generated for the in¬ 
put that is classified as a background image, precision and 
recall scores are all zeros at all thresholds but 0 (the recall 
score is 1 when the threshold is 0, indicating all pixels are 
recognized as salient). This is why PR curves become flat 
when the recall approaches to 1. 

We can see in Fig. 5 that our approach PR curves are 
higher than others on most places. To this end, the linear 
version (LSSVM) outperforms other unsupervised and su¬ 
pervised approaches on both MSRA-B and ECSSD datasets 
in terms AP scores. Augmented with the explicit y 2 kernel 
feature mapping, better performance can be achieved, indi¬ 
cating that the salient object existence and detection prob¬ 
lems can be mutually beneficial by modeling them in a uni¬ 
fied framework. Specifically, y 2 LSSVM performs better 
than the second best method by 6.8% (RBD) on MSRA- 
B and by 5.5% (DRFI) on ECSSD. While the MAE scores 
are not as superior as the AP scores, y 2 LSSVM is ranked 
as the third best on both MSRA-B and ECSSD datasets. 
The reason why it performs inferior on the test set might be 
that our approach can not always produce all-black saliency 
maps for background images as other methods 2 . 

2 Recall that we produce an all-black saliency map if rbf SVM recog¬ 
nizes an input as a background image. 


Table 3. AP and MAE scores compared with state-of-the-art ap¬ 
proaches on different benchmark datasets, where supervised ap¬ 
proaches are marked with bold fonts. The best three scores are 
highlighted with red, green, and blue fonts, respectively. (Up¬ 
dated.) 



AP 

MAE 


MSRA-B 

ECSSD 

MSRA-B 

ECSSD 

Test Set 

rbfSVM + SVO 

0.631 

0.458 

0.333 

0.388 

0.212 

rbfSVM + CA 

0.512 

0.390 

0.241 

0.326 

0.101 

rbfSVM + CB 

0.652 

0.483 

0.184 

0.275 

0.111 

rbfSVM + RC 

0.672 

0.506 

0.135 

0.233 

0.093 

rbfSVM + SF 

0.607 

0.473 

0.168 

0.270 

0.052 

rbfSVM + LRK 

0.680 

0.483 

0.207 

0.295 

0.118 

rbfSVM + HS 

0.631 

0.479 

0.153 

0.258 

0.104 

rbfSVM + GMR 

0.709 

0.517 

0.126 

0.235 

0.085 

rbfSVM + PCA 

0.666 

0.468 

0.185 

0.282 

0.080 

rbfSVM + MC 

0.701 

0.509 

0.142 

0.247 

0.101 

rbfSVM + DSR 

0.694 

0.524 

0.119 

0.229 

0.076 

rbfSVM + RBD 

0.732 

0.530 

0.113 

0.226 

0.080 

rbfSVM + DRFI 

0.732 

0.548 

0.129 

0.231 

0.101 

rbfSVM + HDCT 

0.707 

0.502 

0.148 

0.250 

0.112 

LSSVM 

0.748 

0.573 

0.129 

0.237 

0.086 

X 2 LSSVM 

0.780 

0.578 

0.123 

0.231 

0.097 


In Fig. 4, we provide qualitative comparisons of our ap¬ 
proach and other top performing approaches. As can be 
seen, our LSSVM approach can produce appealing saliency 
maps on images where salient objects touch the image bor¬ 
der, although we utilize the background prior to extract 
regional saliency features and constrain the latent salient 
object detection. Moreover, on background images, our 
LSSVM approach generates near all-black saliency maps, 
clearly denoting no existence of salient objects. 

On a PC equipped with an Intel i7 CPU (3.4GHz) and 
32GB RAM, it takes about 12h to train our approach using 
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Figure 5. Precision-Recall curves of different approaches on 
MSRA-B and ECSSD benchmark datasets. (Updated.) 

MATLAB code and 0.5h to train the rbf SVM using C++. 
In testing, it takes around 3s to extract features. Our ap¬ 
proach takes 0.02s for joint inference of the existence label 
and saliency map. In contrast, it takes 0.21s for the rbf SVM 
to predict the salient object existence (excluding feature ex¬ 
traction) and RBD takes 0.3s to output a saliency map. 

5.5. Limitations 

Sometimes our approach makes incorrect classifications 
between salient object images and background images. 
See Fig. 6 for some failure cases. In the top row, the bird 
is hiding in the leaves, where the cluttered background and 
complex structure of the bird make the salient object detec¬ 
tion difficult even for a human being at a first glance. In the 
bottom row, textures of the image produce inferior saliency 
features, resulting in an incorrect classification. 

6. Discussion and Conclusion 

In this paper, we propose a weakly supervised learn¬ 
ing approach for salient object detection based on the la¬ 
tent structural SVM framework using background images. 
Without any prior assumption of existence of salient ob¬ 
jects, our approach is capable of jointly dealing with salient 
object existence prediction and detection tasks. Experimen¬ 
tal results on benchmark datasets validate the effectiveness 
of our approach. 

As a potential application, if we could recognize a back¬ 
ground image, we no longer need to resort to complicated 



(a) (b) (c) (d) 

Figure 6. Failure cases of our LSSVM approach. Top row is a 
salient object image that is incorrectly recognized as a background 
image. Bottom row is a background image mis-classified as a 
salient object image. From left to right: (a) input images, (b)(c) 
saliency features of boundary connectivity and manifold ranking 
on the LAB histogram channel, and (d) saliency maps produced 
by our LSSVM approach. 

content-aware image resizing techniques ( e.g . [2]). Instead, 
standard bicubic interpolation method may be enough for 
background images shown in Fig. 4. 

For future work, we plan to investigate more advanced 
global features, such as CNN features used in [46], to fur¬ 
ther increase the accuracy of classification of salient object 
images and background images. 

Since most existing approaches focus on unsupervised 
and supervised scenarios, we hope our work to draw atten¬ 
tion of researchers on the weak supervision and make them 
realize the value of background images. We will release our 
code and background images for further research. 
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